BASIC PYTHON FOR RESEARCHERS

by Megat Harun Al Rashid bin Megat Ahmad
last updated: April 14, 2016


2. Strings and Files


2.1 Strings Manipulation

Python has many built-in classes that support the processing of text or commonly known as strings. Strings can be declared by quoting the text (with single (') or double (") quotes).


In [1]:
'Hello World, I am an evangelical Python enthusiast'


Out[1]:
'Hello World, I am an evangelical Python enthusiast'

Just like numerical value, strings can be assigned to a variable which can be used later.


In [2]:
str1 = 'Hello World'
str2 = 'I am an evangelical Python enthusiast'

In [3]:
str1


Out[3]:
'Hello World'

In [4]:
str2


Out[4]:
'I am an evangelical Python enthusiast'

Strings can be stiched together using the '+' operator...


In [5]:
sente = str1 + ", " + str2
sente


Out[5]:
'Hello World, I am an evangelical Python enthusiast'

...as well as mutiply with the '*' operator:


In [6]:
5*'hello ,'


Out[6]:
'hello ,hello ,hello ,hello ,hello ,'

The quoting of strings can also be done with triple single ($'''$) and triple double ($"""$) quotes.


In [7]:
str3 = 'I am ' + '''not ''' + "the " + """we of anyone"""

In [8]:
str3


Out[8]:
'I am not the we of anyone'

This allows the usage of both the single and double quotes as part of the strings.


In [9]:
'Katheline Kelly: "That is amazing, you can spell "fox"! Can you spell "dog"?"'


Out[9]:
'Katheline Kelly: "That is amazing, you can spell "fox"! Can you spell "dog"?"'

In [10]:
'''Joe Fox: "...but I am in the middle of a project \
that needs "tweaking""'''


Out[10]:
'Joe Fox: "...but I am in the middle of a project that needs "tweaking""'

You will notice there is an escape character '\' between the words "project" and "that". This allows the strings (or any Python statement) to break a line i.e. when one Python statement is long it can be continued to the next line by using escape '\' to improve readability.

The length of strings or the number of characters in the strings can be obtained using the **len()** function.


In [11]:
len(str3)


Out[11]:
25

Python strings classes support what is called the sequence type method i.e. strings in Python behave like a list. Therefore strings can be sliced, extracted and reassigned.


In [12]:
str0 = 'There is a wisdom of the head, and \
a wisdom of the heart'

In [13]:
str0


Out[13]:
'There is a wisdom of the head, and a wisdom of the heart'

Characters in strings can be extracted by specifying the location or range of the strings. The location of the character can start from both left and right of the strings. From the left side, the character position starts from '0' and above whereas from right side, the character position starts from '-1' and below. Both can be used in slicing the strings. If one of the positions are not specified during slicing, then every character will be included in the direction of the position. The end position number must be one step higher than the desired end position of the strings. This is helpful control flow operation as first position starts with the number '0'.


In [14]:
str0[0]


Out[14]:
'T'

In [15]:
str0[-1]


Out[15]:
't'

In [16]:
str0[11:17]


Out[16]:
'wisdom'

In [17]:
str0[-45:-39]


Out[17]:
'wisdom'

In [18]:
str0[11:-39]


Out[18]:
'wisdom'

In [19]:
str0[-9:]


Out[19]:
'the heart'

In [20]:
str0[:-27]


Out[20]:
'There is a wisdom of the head'

In [21]:
str0[:]


Out[21]:
'There is a wisdom of the head, and a wisdom of the heart'

Below is a table that shows the position of characters in the strings 'Python vs. Perl' from left to right and from right to left:

P y t h o n v s . P e r l
Position starting from left 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Position starting from right -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

Example 2.1: Create a new strings that contains the word 'Person' by slicing and extracting the character from the strings 'Python vs. Perl'.

$Answer$:


In [22]:
str_table = 'Python vs. Perl'
str_table[-4:-1] + str_table[-7] + str_table[4:6]


Out[22]:
'Person'

2.2 Printing and Formatting

Python has the **print** function that can print formatted output.


In [23]:
print "Hello Everyone"


Hello Everyone

In [24]:
print "Hello ","Everyone"


Hello  Everyone

In [25]:
x = 45.6
y = 98.1
print x + y


143.7

Formatted output can be printed using format specifiers and escape characters.


In [26]:
x = 45.678
y = 98.142
print '%.2f added to %.3f gives \n%.1f' % (x,y,x+y)


45.68 added to 98.142 gives 
143.8

The sequence and numbers of the format specifiers for the strings must match the number of arguments in the parenthesis afterwards. Arguments in parenthesis can be numbers, strings and operational expression. In the above example, the ' %.2f ' term means that it will be replaced by the variable $x$ but with only two decimal places displayed instead of the original three as specified by the ' .2 ' term. The letter ' f ' indicates that only variable with floating point value can be accepted. The ' \n ' is an escape character and in this case it means that printing will be done in new line afterwards.

Some of the formatted strings that can be used:

Formatted Strings Type
%c char
%d decimal
%s string
%f floating

Some of the escape characters that can be used:

Escape Character Function
\t tab
\n new line
|'\'
\s space
\v vertical tab

Example 2.2: This list contains the names of examinees and their marks (in parenthesis). Print the names and the marks in two columns.

“Friya (78), Darylene (84), Femi (91), Aiko (100) and Holda (80)”.

Then calculate and print the average mark.

$Answer$:


In [27]:
str_list = 'Friya (78), Darylene (84), Femi (91), Aiko (100) \
and Holda (80)'

n1 = str_list[0:5]; n2 = str_list[12:20]; n3 = str_list[27:31]
n4 = str_list[-25:-21]; n5 = str_list[-10:-5]

# Checking whether extract/slice is correct

print '%s\t%s\t%s\t%s\t%s\n' % (n1,n2,n3,n4,n5)

m1 = str_list[7:9]; m2 = str_list[22:24]; m3 = str_list[33:35]
m4 = str_list[-19:-16]; m5 = str_list[-3:-1]

print '%s\t%s\t%s\t%s\t%s\n' % (m1,m2,m3,m4,m5)

'''Printing the results 
in two columns'''

print 'Names\t\tMarks'
print 21*'-'
print '%s\t\t%s' % (n1,m1)
print '%s\t%s' % (n2,m2)
print '%s\t\t%s' % (n3,m3)
print '%s\t\t%s' % (n4,m4)
print '%s\t\t%s' % (n5,m5)

print '\nThe average mark is %d' % ((int(m1)+int(m2)+int(m3)\
                                  +int(m4)+int(m5))/5)


Friya	Darylene	Femi	Aiko	Holda

78	84	91	100	80

Names		Marks
---------------------
Friya		78
Darylene	84
Femi		91
Aiko		100
Holda		80

The average mark is 86

In this exercise, the '#' is used before writing a one line comments in the program. This line will not be executed. Multiple lines comments can be inserted by quoting the comments in triple single (''') or double (""") quotes. Comments assist program readability.


2.3 Input and Output Functions

Apart from direct writing of numerical digits and strings (by quoting) and assigning them to variables, Python provides input capability via key-in and also from reading file. The output can also be saved into a file. In general, the input and output functions available are:

  1. Common input functions:

    1.1. Key-in input from keyboard
    input, raw_input
    1.2. Reading from file

  2. Common output functions:

    2.1. Print on the screen
    2.2. Write to file


In [28]:
# input function: key-in integer value

int_x = input('Key-in any integer (and press Enter): ')

print 'The input value is %d' % int_x
print '8 x %d = %d' % (int_x,8*int_x)


Key-in any integer (and press Enter): 32
The input value is 32
8 x 32 = 256

In [29]:
# input function: key-in a string

str_x = input('Key-in any word (and press Enter): ')

print 'The input word is "%s"' % str_x


Key-in any word (and press Enter): "Hello"
The input word is "Hello"

The **input()** function accepts any numerical digits and quoted strings as input whereas the **raw_input()** function converts all inputs into strings (without the needs to quote the strings).


In [30]:
int_x = raw_input("Key-in any integer (and press Enter): ")
print "The input value is %s" % int_x

str_x = raw_input("Key-in any word (and press Enter): ")
print "The input word is %s" % str_x


Key-in any integer (and press Enter): 36
The input value is 36
Key-in any word (and press Enter): Hello
The input word is Hello

Here the value of $int$_$x$ variable is actually a strings and cannot be operated mathematically. Therefore, it needs to be converted to a number.


In [31]:
int_x = int(raw_input("Key-in any integer (and press Enter): "))
print "The input value when multiply with %d is %d" % (4,int_x*4)

fx = float(raw_input("\nKey-in any floating number (and press Enter): "))
print "The floating value when divided \nwith %d is %.2f" % (8,fx/8)


Key-in any integer (and press Enter): 32
The input value when multiply with 4 is 128

Key-in any floating number (and press Enter): 42.89
The floating value when divided 
with 8 is 5.36

2.4 Reading and Writing File

Reading from and writing to a file can be done by using the **open()** function.


In [32]:
# Opening a file 
file_read = open("Tutorial2/les miserables.txt")
# file_read = open("Tutorial2/les miserables.txt","r")
file_read.close()

Here the **open()** function is followed by the quoted name of the file in parenthesis. When opening a file, by default it is for reading i.e. it is in the 'r' mode which is usually omitted. The file needs to be closed after reading it. In the example below, the $les miserables.txt$ file is opened for reading (with reading mode specified) and the content of the file is printed. If the file is not in the same directory of the working notebook, then the directory path of the file needs to be specified.


In [33]:
file_read = open("Tutorial2/les miserables.txt","r")

print 'Name of the file: %s\n' % file_read.name

# Read and print the whole file as strings
texts = file_read.read()
print texts

# Closing the file
file_read.close()


Name of the file: Tutorial2/les miserables.txt

Preface from Les Miserables

So long as there shall exist, by reason of law and custom,
a social condemnation, which, in the face of civilisation,
artificially creates hells on earth, and complicates a
destiny that is divine, with human fatality; so long as the
three problems of the age - the degradation of man by poverty,
the ruin of woman by starvation, and the dwarfing of childhood
by physical and spiritual night - are not solved; so long as,
in certain regions, social asphyxia shall be possible; in other
words, and from a yet more extended point of view, so long as
ignorance and misery remain on earth, books like this cannot
be useless.

HAUTEVILLE HOUSE, 1862.
FANTINE

In the following example, the content of the file is read as strings and assigned to a variable, so that excerpts from the file can be extracted.


In [34]:
file_read = open("Tutorial2/les miserables.txt","r")

texts = file_read.read()

'''printing the number of characters 
(including empty spaces) in the file'''
print len(texts)

# Extracting certain portion of the file
excerpt =  texts[290:419]
print excerpt

file_read.close()


681
the degradation of man by poverty,
the ruin of woman by starvation, and the dwarfing of childhood
by physical and spiritual night

Writing this extracted excerpt into another file can be done by opening the another file in writing mode 'w'.


In [35]:
file_write = open("Tutorial2/contents.txt","w")
file_write.write("Important points from Victor Hugo's Les Miserables:\n\n")
file_write.write(excerpt)
file_write.close()

In [36]:
# Reading back the written file

file_new = open("Tutorial2/contents.txt","r")
print file_new.read()
file_new.close()


Important points from Victor Hugo's Les Miserables:

the degradation of man by poverty,
the ruin of woman by starvation, and the dwarfing of childhood
by physical and spiritual night

Some important opening file modes:

Mode Function
r Read only, pointer at beginning of the file, default mode
w Write only
r+ Read and write, pointer at beginning of the file
a Append, pointer at the end of the file
a+ Append and read, pointer at the end of the file

Exercise 2.1: Write the previous extracted excerpt in formatted points form with additions into a file named "Answer_to_2_1.txt":

The characters those affected by society environments in Victor Hugo's Les Miserables:

1) The degradation of man by poverty.

-   Jean Valjean

2) The ruin of woman by starvation.

-   Fantine 

3) The dwarfing of childhood by physical and spiritual night.

-   Cosette

$Answer$:


In [37]:
file_read = open("Tutorial2/les miserables.txt","r")

texts = file_read.read()

# Extracting certain portion of the file
excerpt =  texts[290:419]
# print excerpt # if you want to check the excerpt first

file_read.close()

excerpt1 = excerpt[0:33]
excerpt2 = excerpt[35:66]
excerpt3 = excerpt[68:130]
excerpt3_1 = excerpt3[0:29]
excerpt3_2 = excerpt3[30:]

file_write = open("Tutorial2/Answer_to_2_1.txt","w")
file_write.write("The characters those affected by society \
environments\nin Victor Hugo's Les Miserables:\n")
file_write.write('\n1.\t'+excerpt1)
file_write.write('\n\t-\tJean Valjean\n')
file_write.write('\n2.\t'+excerpt2)
file_write.write('\n\t-\tFantine\n')
file_write.write('\n3.\t'+excerpt3_1+' '+excerpt3_2)
file_write.write('\n\t-\tCosette')

file_write.close()

# Checking back by reading the written file

file_new = open("Tutorial2/Answer_to_2_1.txt","r")
print file_new.read()
file_new.close()


The characters those affected by society environments
in Victor Hugo's Les Miserables:

1.	the degradation of man by poverty
	-	Jean Valjean

2.	the ruin of woman by starvation
	-	Fantine

3.	and the dwarfing of childhood by physical and spiritual night
	-	Cosette

2.5 The split and replace functions

Python has the regular expression libraries, **re** and **regex** that can perform sophisticated text processing. Regular expression is a very large topic and will not be discussed here.

Further information on Python **regex** library can be found in https://pypi.python.org/pypi/regex whereas the Python native **re** library can be found in https://docs.python.org/2/library/re.html

Instead, users can focus on using the **split()** and **replace()** functions. These functions can assist the users a lot when doing text processing. They, however, are not really part of Python regular expression libraries.


In [38]:
line = 'Oz: "Greatness?"; Glenda: "No, better than that, goodness"'

In [39]:
line_list = line.split(";")
line_list


Out[39]:
['Oz: "Greatness?"', ' Glenda: "No, better than that, goodness"']

In the above instance, the content of $line$ is splitted into separated elements according to the position of ";" and assigned to $line\_list$ variable. The variable $line\_list$ now contains two elements (separated by comma). The variable $line\_list$ is now a list (We will further explore list in the The Sequence topic).


In [40]:
line_list[0]


Out[40]:
'Oz: "Greatness?"'

In [41]:
line_list[1]


Out[41]:
' Glenda: "No, better than that, goodness"'

In [42]:
line_list = line.split(":")
line_list


Out[42]:
['Oz', ' "Greatness?"; Glenda', ' "No, better than that, goodness"']

In [43]:
line_list[2]


Out[43]:
' "No, better than that, goodness"'

If **split()** function is used without any arguments than the strings elements in $line$ will be splitted according to empty spaces.


In [44]:
line_list = line.split()
print line_list


['Oz:', '"Greatness?";', 'Glenda:', '"No,', 'better', 'than', 'that,', 'goodness"']

Multiple splitting can be performed by replacing certain markers into single type specific marker and then executing the **split()** function. The replacement process can be carried out using the **replace()** function.


In [45]:
line_list = line.replace(":","")\
.replace(";","").replace(",","").\
replace('"',"").replace("?","").split()
print line_list


['Oz', 'Greatness', 'Glenda', 'No', 'better', 'than', 'that', 'goodness']

The **replace($x$,$y$)** function receives two arguments: $x$ and $y$ i.e. strings object $y$ replaces strings object $x$.


In [46]:
no_ser="1\t2\t3\t4\n5\t6\t7\t8\n"
print no_ser
no_list = no_ser.replace("\t"," ").replace("\n"," ").split()
print no_list


1	2	3	4
5	6	7	8

['1', '2', '3', '4', '5', '6', '7', '8']